I saw the populous sea, saw dawn and dusk, saw the multitudes of the Americas, saw a silvery spiderweb at the center of a black pyramid, saw a broken labyrinth (it was London), saw endless eyes, all very close, studying themselves in me as though in a mirror, saw all mirrors on the planet (and none of them reflecting me), saw in a rear courtyard on Calle Soler the same tiles I’d seen twenty years before in the entryway of a house in Fray Bentos, saw clusters of grapes, snow, veins of metal, water vapor, saw convex equatorial deserts and their every grain of sand…
Jorge Luis Borges, The Aleph
In RNA seq, one often ends up with a list of differentially expressed genes. The next question is often “what do we do with all of this?” And the answer is typically some sort of functional enrichment analysis. What you get out of this is a number of terms that are associated with biological context, be it GO terms or pathways.
The problem here is that this list itself can be overwhelming. So what happens when we do functional enrichment analysis of a deluge of DEGs, and then a deluge of terms? Here, we will use the lesser known “embedding” feature of some LLMs to group these terms thematically on a visual map.
In this markdown, we are going to run gprofiler on example data. We will then make a “map” of the results by placing the output into a contextual embedding, running a UMAP on that, and then making an interactive visualization using plotly. Again, the purpose of this is to make a deluge of gprofiler output (or any type of analysis that produces GO terms, pathways, etc) a bit easier to understand.
Specifically, this map will group terms based on context. For example, if there are a number of terms that are around brain development, then those will be grouped near to each other on the map.
We first need to get an example gene list. So let’s get started.
For this analysis, we will use g:Profiler. We will first run it using the gprofiler2 package in R. This will take in differentially expressed genes (DEGs) as input, and return statistically enriched annotation terms, from GO terms to pathways.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(gprofiler2)
set.seed(42)
Ok, and let’s get some data. The sample genes below were taken from the g:Profiler website, and clicking on “random example” in the link.
sample_genes <- c(
"ENSG00000158055", "ENSG00000116539", "ENSG00000180340", "ENSG00000100600",
"ENSG00000136518", "ENSG00000141646", "ENSG00000129173", "ENSG00000073146",
"ENSG00000166337", "ENSG00000158445", "ENSG00000197046", "ENSG00000134250",
"ENSG00000106078", "ENSG00000121454", "ENSG00000018408", "ENSG00000106689",
"ENSG00000135439", "ENSG00000164082", "ENSG00000163913", "ENSG00000113456",
"ENSG00000138080", "ENSG00000152661", "ENSG00000145864", "ENSG00000055917",
"ENSG00000125398", "ENSG00000170175", "ENSG00000159082", "ENSG00000186090",
"ENSG00000152583", "ENSG00000164171", "ENSG00000101958", "ENSG00000259207",
"ENSG00000069849", "ENSG00000145147", "ENSG00000109158", "ENSG00000141655",
"ENSG00000099337", "ENSG00000154736", "ENSG00000125965", "ENSG00000126583"
)
gost_out <- gprofiler2::gost(query = sample_genes, organism = "hsapiens", evcodes = TRUE)
gostres <- gost_out$result %>% as_tibble()
Now we are going to look at our results, so you can see what the best practices are in terms of visualizing it.
First, let’s look at what you are given:
gostres
## # A tibble: 221 × 16
## query significant p_value term_size query_size intersection_size precision
## <chr> <lgl> <dbl> <int> <int> <int> <dbl>
## 1 query_1 TRUE 1.50e- 2 3 13 2 0.154
## 2 query_1 TRUE 1.50e- 2 3 13 2 0.154
## 3 query_1 TRUE 5.84e-11 3039 40 27 0.675
## 4 query_1 TRUE 2.77e-10 7669 40 37 0.925
## 5 query_1 TRUE 4.56e- 9 3973 40 28 0.7
## 6 query_1 TRUE 5.44e- 9 611 40 14 0.35
## 7 query_1 TRUE 2.23e- 8 5899 40 32 0.8
## 8 query_1 TRUE 2.33e- 8 1183 40 17 0.425
## 9 query_1 TRUE 3.14e- 8 6453 40 33 0.825
## 10 query_1 TRUE 1.09e- 7 617 40 13 0.325
## # ℹ 211 more rows
## # ℹ 9 more variables: recall <dbl>, term_id <chr>, source <chr>,
## # term_name <chr>, effective_domain_size <int>, source_order <int>,
## # parents <list>, evidence_codes <chr>, intersection <chr>
names(gostres)
## [1] "query" "significant" "p_value"
## [4] "term_size" "query_size" "intersection_size"
## [7] "precision" "recall" "term_id"
## [10] "source" "term_name" "effective_domain_size"
## [13] "source_order" "parents" "evidence_codes"
## [16] "intersection"
So you get a data frame where each row is a term. Importantly, you get a p-value, and (beyond the scope of this markdown, but I use it) if you set evcodes to TRUE, you get a term called “intersection” which gives you the genes in your data that were part of the given term.
Ok, and are there any good built in visualizations for this? Yes. Here is the one that I use.
gostplot(gost_out)
You can see that is an interactive plot (probably plotly or something similar running in the back end) where if you hover over a point, it gives you the term (grouped on the x axis) and the p-value associated with it (which is also shown on the y axis after a -log10 transform).
However, to my eyes, this is still a bit much. I’m glad to see the terms grouped by category (GO versus KEGG, etc). But I’d like to see the data points grouped by context, regardless of group. So we need to make another visualization. This is where we can leverage the rapid developments of LLMs, and specifically the lesser known application of creating embeddings rather than chatbots.
So now it’s time to set up our contextual embedding step. We are going to use a model that is similar to BERT. In comparison to any of the GPTs, our model, like BERT, produces a high-dimensional embedding rather than a next token prediction. We note that the GPT models can produce embeddings as well, but that is not the primary use case for them at the time of writing.
In short, this is going to take in our results above and produce a contextual embedding that we will eventually be able to visualize.
To get access to our model, we can use the sentence_transformers python package. But this is a R Markdown, so while I typically use python directly to do this type of work, here I am going to use the reticulate package to do it.
I have already set up a virtual environment, which I activate below.
library(reticulate)
use_virtualenv("your_venv_here", required = TRUE)
We test python code here:
py_run_string("print('hello world')")
## hello world
And test python code with imports below:
np <- import("numpy")
x <- np$array(c(1, 2, 3, 4, 5))
print(x)
## [1] 1 2 3 4 5
print(np$mean(x))
## [1] 3
And now we get down to business. Below, we import the sentence_transformers package, where we use all-mpnet-base-v2 as a model. It will take the gprofiler output that we made and convert each term into a 768 dimensional vector, whereby terms that are contextually similar to each other will be grouped near each other in that 768-dimensional vector space.
When you run the code below the first time, it will take a few minutes to run, because the model has to download.
# Import the necessary modules
st <- import("sentence_transformers")
# Load the model
model <- st$SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
# Create a function to encode sentences
encode_sentences <- function(sentences) {
# Convert sentences to a Python list
py_sentences <- r_to_py(sentences)
# Encode the sentences
embeddings <- model$encode(py_sentences)
# Convert the result back to an R matrix
return(py_to_r(embeddings))
}
# Example usage
embeddings <- encode_sentences(gostres$term_name)
# Print the shape of the embeddings
print(dim(embeddings))
## [1] 221 768
Ok, it looks like it worked. Let’s look at what the model output looks like:
as_tibble(embeddings)
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
## `.name_repair` is omitted as of tibble 2.0.0.
## ℹ Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## # A tibble: 221 × 768
## V1 V2 V3 V4 V5 V6 V7 V8
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0403 -0.0777 0.0356 0.0185 0.00150 0.0220 0.0337 -0.0575
## 2 0.0490 -0.0856 0.0346 0.0226 -0.0000313 0.0234 0.0331 -0.0566
## 3 -0.00649 0.0371 -0.00633 -0.0310 0.0373 0.00806 0.0660 -0.0400
## 4 -0.0214 -0.0380 0.00458 -0.0374 -0.0148 -0.00459 0.0334 -0.0257
## 5 0.0280 -0.0105 -0.0462 0.00284 0.00378 -0.000652 0.0232 -0.0536
## 6 -0.0351 -0.00546 0.00257 -0.0401 0.00800 -0.0188 0.0291 -0.0282
## 7 -0.0149 0.0348 -0.0404 0.0204 0.0174 0.00283 0.0723 -0.0114
## 8 -0.0407 -0.0215 -0.0127 -0.00435 -0.0246 -0.00159 0.0321 -0.0445
## 9 -0.00593 -0.000992 -0.0213 -0.0147 -0.0110 -0.00728 0.00382 -0.00579
## 10 -0.0548 0.00839 -0.0163 -0.0192 -0.00201 -0.0166 0.0463 -0.0332
## # ℹ 211 more rows
## # ℹ 760 more variables: V9 <dbl>, V10 <dbl>, V11 <dbl>, V12 <dbl>, V13 <dbl>,
## # V14 <dbl>, V15 <dbl>, V16 <dbl>, V17 <dbl>, V18 <dbl>, V19 <dbl>,
## # V20 <dbl>, V21 <dbl>, V22 <dbl>, V23 <dbl>, V24 <dbl>, V25 <dbl>,
## # V26 <dbl>, V27 <dbl>, V28 <dbl>, V29 <dbl>, V30 <dbl>, V31 <dbl>,
## # V32 <dbl>, V33 <dbl>, V34 <dbl>, V35 <dbl>, V36 <dbl>, V37 <dbl>,
## # V38 <dbl>, V39 <dbl>, V40 <dbl>, V41 <dbl>, V42 <dbl>, V43 <dbl>, …
You can see it’s 768 columns of numbers that don’t really have any meaning to us, but they do group the rows contextually.
But how can we utilize this, given that we don’t operate in 768 dimensional space? We can use dimensionality reduction tools to compress the information down to 2 dimensions.
To this end, we will run UMAP below on the embeddings.
library(umap)
dimr <- umap::umap(embeddings)$layout %>% as_tibble()
names(dimr) <- c("umap1", "umap2")
And now we will do a quick ggplot of this as a sanity check.
ggplot(dimr, aes(x = umap1, y = umap2)) + geom_point()
So there are our points. But this means nothing to us right now, and adding the labels directly to the map above makes it very crowded.
So we are going to do what gprofiler did, and make an interactive plot out of this. We will use plotly to this end. I typically use plotly within python, but there is an R package for it below.
The syntax is a bit dicey, but this is where the LLMs come in handy: giving you the proper syntax to use a syntax-heavy package. Anyway the code below gets the job done.
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# Calculate negative log10 of p-value
dimr$neg_log10_p <- -log10(gostres$p_value)
dimr$category <- gostres$source
dimr$name <- gostres$term_name
# Create a function to scale the sizes
scale_sizes <- function(x, min_size = 5, max_size = 30) {
(x - min(x)) / (max(x) - min(x)) * (max_size - min_size) + min_size
}
# Create the plot
p <- plot_ly(dimr,
x = ~umap1,
y = ~umap2,
type = "scatter",
mode = "markers",
marker = list(
size = ~scale_sizes(neg_log10_p),
sizemode = "area",
sizeref = 0.05,
sizemin = 4
),
color = ~category, # Assuming 'category' is the column with GO/REAC info
hoverinfo = "text",
text = ~paste("Name: ", name,
"<br>Source: ", category,
"<br>-log10(p-value): ", round(neg_log10_p, 2))) %>%
layout(
title = "Contextual embeddings of terms",
legend = list(title = list(text = "Category")),
showlegend = TRUE
)
# Add size legend
p <- p %>% add_trace(
x = c(),
y = c(),
type = "scatter",
mode = "markers",
marker = list(
size = ~scale_sizes(neg_log10_p),
color = "rgba(0,0,0,0)"
),
showlegend = TRUE,
legendgroup = "size",
name = "Size (-log10 p-value)",
hoverinfo = "none"
)
p
And now we have an interactive plot that visualizes our results. You can hover over a point to get the p-value and term name. It is colored by categories (eg. GO, KEGG, etc).
Hover over it for a minute to convince yourself that the plots are grouped thematically, and you’ll see the utility of having a visualization like this in your post-gprofiler analysis.
This is a nice supplement to the built in gostplot, in that you can see that points on the map that are contextually similar to each other are grouped near to each other on the map, giving you another lens through which you can look at an overwhelming number of terms.
The problem I solved above is a subset of a much bigger problem that really started with the beginning of the -omics era: what do I do with all this data?
It used to be that the more data you had, the better. I remember this to be the case in the first biology lab I worked in, where experiments were primarily western blots and 2-color immunofluorescence (not including DAPI). But now, every experiment gives you a firehose of data.
The problem of what to do with DEGs has been around since the beginnings of bulk RNA-seq (though it is still an issue for single-cell). I remember first seeing a talk where GO terms were used and still being overwhelmed by them.
Accordingly, the advent of LLMs and advancements in AI are finally allowing for another set of solutions to deal with all this data. The most popular one right now (or at least the loudest in the media) is the chatbots. There are of course applications here in terms of, for example, taking a list of DEGs (control vs disease, or genes that distinguish cluster X from not-cluster X) as a prompt, and seeing what the chatbot gives you.
The current problem with this kind of application is of course hallucination or confabulation (I prefer the latter term). You don’t really know what is correct and what is not, although the more recent chain-of-thought reasoning models are lowering the probability of error to this end.
My approach, using an embedding, is more about contextual similarity rather than factual text. This embedding-based approach gives you output that is generated by the model, but rather than spoon feeding you an answr, it converts an overwhelming list into a map that is easier for the human viewer to reason around.
While we should still iterate on improving the chatbots (and that is not going to stop any time soon), I think we should also remember these embeddings as a solution to the “I’m overwhelmed by data” problem.
I will note here that a lot of the work on single-cell foundation models involve the development of embeddings as well, so there is a theme here that the reader should be aware of. For more on that, have a look at my work on the Universal Cell Embeddings (UCE) model, here.